The article presents Fast-dLLM, a method for accelerating diffusion-based large language models (LLMs) by implementing a novel block-wise Key-Value (KV) Cache and a confidence-aware parallel decoding strategy. This approach significantly improves inference speed, achieving up to 27.6 times throughput enhancement with minimal accuracy loss, thereby making diffusion LLMs competitive with autoregressive models. The findings demonstrate the potential for practical deployment in real-world applications.
diffusion ✓
language models ✓
acceleration ✓